AITopics | self-attention mechanism

Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.

artificial intelligence, natural language, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

eeae43a68515325cad64c0f54b2d0c70-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 15:28:34 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(2 more...)

Add feedback

Designing Robust Transformers using Robust Kernel Density Estimation

Neural Information Processing SystemsFeb-16-2026, 09:04:30 GMT

Transformer-based architectures have recently exhibited remarkable successes across different domains beyond just powering large language models. However, existing approaches typically focus on predictive accuracy and computational cost, largely ignoring certain other practical issues such as robustness to contaminated samples. In this paper, by re-interpreting the self-attention mechanism as a non-parametric kernel density estimator, we adapt classical robust kernel density estimation methods to develop novel classes of transformers that are resistant to adversarial attacks and data contamination. We first propose methods that down-weight outliers in RKHS when computing the self-attention operations. We empirically show that these methods produce improved performance over existing state-of-the-art methods, particularly on image data under adversarial attacks. Then we leverage the median-of-means principle to obtain another efficient approach that results in noticeably enhanced performance and robustness on language modeling and time series classification tasks. Our methods can be combined with existing transformers to augment their robust properties, thus promising to impact a wide variety of applications.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.66)

Industry:

Information Technology > Security & Privacy (0.68)
Government > Military (0.54)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Middle-Out Decoding

Shikib Mehri, Leonid Sigal

Neural Information Processing SystemsFeb-12-2026, 05:29:57 GMT

Neural Information Processing Systems http://nips.cc/

decoder, proceedings, sequence, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

bd31bfd4caa85bffe07a35568182cdfa-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 16:11:28 GMT

agent, coordination pattern, factorization, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

be3e9d3f7d70537357c67bb3f4086846-Supplemental.pdf

Neural Information Processing SystemsFeb-10-2026, 23:13:49 GMT

Amaximum of20K generations is specified in the training, but stopped early if the performance converged. We consider two possible approaches when we take sample-efficiency into consideration. A.4.2 PyBulletAnt In the PyBullet Ant experiment, we demonstrated that a pre-trained policy can be converted into a permutation invariant one with behavior cloning (BC). We give detailed task description and experimental setups here. Thesecond, larger policy is similar in architecture, but we added one more FC layer and expanded all hidden size to128to increase its expressiveness.

artificial intelligence, opération, self-attention mechanism, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.30)

Add feedback

20e45668fefa793bd9f2edf19be12c4b-Paper-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 20:54:49 GMT

arxiv preprint arxiv, attention weight, explanation, (15 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Intriguing Properties of Vision Transformers

Neural Information Processing SystemsDec-24-2025, 21:12:59 GMT

Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a)Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content.

intriguing property, name change, vision transformer, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.58)

Add feedback

Filters

Collaborating Authors

self-attention mechanism

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

28e4ee96c94e31b2d040b4521d2b299e-Paper-Conference.pdf

20e45668fefa793bd9f2edf19be12c4b-Paper-Conference.pdf

Graph Convolutions Enrich the Self-Attention in Transformers!

eeae43a68515325cad64c0f54b2d0c70-Paper-Conference.pdf

Designing Robust Transformers using Robust Kernel Density Estimation

Middle-Out Decoding

bd31bfd4caa85bffe07a35568182cdfa-Paper-Conference.pdf

be3e9d3f7d70537357c67bb3f4086846-Supplemental.pdf

20e45668fefa793bd9f2edf19be12c4b-Paper-Conference.pdf

Intriguing Properties of Vision Transformers